1. Introduction
Cancer is the second leading cause of death globally. In 2018 there were approximately 9,6 million deaths worldwide.These numbers are rising and on 2040 the expectation is to have up to 29,5 million new cases. Regardless of the type of cancer present, cancer cells are distinguished for growing out of control and becoming invasive (WHO). They harness the space and nutrients that healthy organs require, that could then trigger the malfunctioning of certain body systems (NIH).
Breast cancer is a type of cancer that starts in the breast and occurs almost entirely in women, 2,1 million of new cases were estimated on 2018 (WHO). A total of 44,130 deaths from breast cancer are estimated to occur on 2021. (ASCO).
There is a need to search for new techniques in order to correctly diagnose and treat breast cancer.
The aim of this project is to develop an algorithm that correctly predicts if a sample of breast cancer cell nucleus is benig or malignant. The aforementioned is going to be performed by comparing various machine learning algorithms taking into account its accuracy and its time of execution.
2. Dataset
This dataset is hosted on Kaggle (Breast Cancer Wisconsin (Diagnostic) Data Set), and it was from UCI Machine Learning Repository.
The dataset contains 30 features plus the ID number and diagnosis of the participants. The features were computed by measuring 10 parameters on each cell nucleus from a digitized image of a fine needle aspirate (FNA) of a breast mass. For each parameter, the mean, standard error and "worst" (mean of the three largest values) were obtained.
A breast fine needle aspiration (FNA) removes some fluid or cells from a breast lesion (a cyst, lump, sore or swelling) with a fine needle similar to that used for blood tests. The sample of fluid or cells (or tissue) is then examined (Borecky).
Real-valued features:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
The main libraries required for the development of this project are shown on the chunk below.
# Libraries
library(rpart.plot)
library(tidyverse)
library(skimr)
library(ggpubr)
# Helper packages
library(ggplot2) # for awesome graphics
library(dplyr) # for data manipulation
library(visdat) # for additional visualizations
# Feature engineering packages
library(caret) # for various ML tasks
library(recipes) # for feature engineering tasksThe database used can be visualized with the table shown below. The sample ID and an empty column were eliminated as they have no relevance in the experiment.
3. Exploratory data analysis
Variable type
First, a descriptive exploration of the variables was performed in order to understand them better and detect possible problems.
| Name | data |
| Number of rows | 569 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| diagnosis | 0 | 1 | FALSE | 2 | B: 357, M: 212 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| radius_mean | 0 | 1 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 | ▂▇▃▁▁ |
| texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 | ▃▇▃▁▁ |
| perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 | ▃▇▃▁▁ |
| area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 | ▇▃▂▁▁ |
| smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 | ▁▇▇▁▁ |
| compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
| concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
| concave.points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
| symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
| fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
| radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
| texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
| perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
| area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 | ▇▁▁▁▁ |
| smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
| compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
| concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
| concave.points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
| symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
| fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
| radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
| texture_worst | 0 | 1 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 | ▃▇▆▁▁ |
| perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 | ▇▇▃▁▁ |
| area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 | ▇▂▁▁▁ |
| smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
| compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
| concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
| concave.points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
| symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
| fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
After visually analyzing the distribuitions of all the features, no abnormalities seem to be found. Other than the target variable, which is a factor, all variables are numeric. The type of variable is correctly assigned to all features.
In addition, there are no missing values.
Dataset features
Target variable
The target variable indicates whether the cancer is benign (B) or malignant (M). As seen before, there are 357 benign samples and 212 malignant samples.
ggplot(data, aes(diagnosis, fill=diagnosis)) +
geom_bar() +
labs(x="Diagnosis", y="Number of patients") +
guides(fill=FALSE) +
scale_fill_manual( values = c("#686868","#9F2042"))Posteriorly, boxplots of all the features were created in order to visualize its importance on classification. These boxplots are grouped in mean, standard deviation and “worse”.
Mean BOXPLOTS
Other than the fractal_dimension_mean boxplot, there seems to be a significant difference in the value off all features when comparing benign and malignant samples. There are many samples that seem to have outliers in the features values. In addition, the range of values varies a lot between features. For example, the mean area of a sample can have a maximum value of 2500 while the highest fractal dimension of a sample doesn’t reach 0,1.
library(gridExtra)
p <- list()
for (j in colnames(data)[2:11]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))Standard Error BOXPLOTS
The standard deviation features show less difference between classes. As before, there are many outliers present.
p <- list()
for (j in colnames(data)[12:21]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))"Worst" BOXPLOTS
The “worst” features show a similar behaviour as that of the mean features.
p <- list()
for (j in colnames(data)[22:31]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))Correlation
Because of the source of the features, a correlation is obviously expected. As seen on the correlation heatmap the highest correlation is found between the perimeter, area and radious.
4. Division of data in training and testing & data preprocessing
Stratified sampling
As seen before, there are more benign that malignan samples. Stratified sampling was used in order to guarantee the same proportion in each class than the one had in the complete data set.
library(rsample)
set.seed(123)
split_strat <- initial_split(data,prop = 0.8, strata = "diagnosis")
datos_train <- training(split_strat)
datos_test <- testing(split_strat)
prop.table(table(data$diagnosis))##
## B M
## 0.6274165 0.3725835
##
## B M
## 0.6283186 0.3716814
##
## B M
## 0.627193 0.372807
5. Feature Engineerig
Some transformations were performed on raw data with the aim of improving the performance of the algorithms.
The recipe object was created in order to be able to pre-process data.
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
Imputation of missing values
There are no missing values, thus there is no need to imputate.
Variables with variance close to zero
As seen on the table below, all predictors have a significant variance. Hence, no features were removed based on this criteria.
Standardization and scaling
The boxplots demonstrated the need for centering and scaling. The difference in scales can have a great impact in the model. The step_normalize function centers and scales the data.
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Operations:
##
## Centering and scaling for all_numeric()
Once the recipe object has been created with the preprocessing transformations, they are learned with the training data and applied to the two sets.
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Training data contained 456 data points and no missing data.
##
## Operations:
##
## Centering and scaling for radius_mean, texture_mean, ... [trained]
datos_train_prep <-trained_recipe %>% bake(new_data = datos_train)
datos_test_prep <- trained_recipe %>% bake(new_data = datos_test)A “glimpse” of the values of the features after applying the transformations is shown below.
## Rows: 456
## Columns: 31
## $ radius_mean <dbl> 1.1026662, 1.8353345, 1.5854321, -0.7630822, …
## $ texture_mean <dbl> -2.0368159, -0.3414951, 0.4568427, 0.2572583,…
## $ perimeter_mean <dbl> 1.27735149, 1.69374265, 1.57418479, -0.586926…
## $ area_mean <dbl> 0.9838048, 1.9037595, 1.5555920, -0.7567497, …
## $ smoothness_mean <dbl> 1.59709014, -0.82451799, 0.96398998, 3.330921…
## $ compactness_mean <dbl> 3.33607335, -0.48595341, 1.07505329, 3.457096…
## $ concavity_mean <dbl> 2.69220042, -0.01326369, 1.38895856, 1.947309…
## $ concave.points_mean <dbl> 2.58556012, 0.57146330, 2.08288679, 1.4885803…
## $ symmetry_mean <dbl> 2.30706285, 0.02186195, 0.98940170, 2.9771876…
## $ fractal_dimension_mean <dbl> 2.31733476, -0.88939910, -0.40635208, 5.04247…
## $ radius_se <dbl> 2.37985886, 0.47285546, 1.17168663, 0.3072245…
## $ texture_se <dbl> -0.52600515, -0.83843605, -0.74182673, -0.069…
## $ perimeter_se <dbl> 2.72216499, 0.25563868, 0.81964691, 0.2779709…
## $ area_se <dbl> 2.325575265, 0.686914706, 1.099058909, -0.280…
## $ smoothness_se <dbl> -0.19032973, -0.58560573, -0.27416595, 0.7224…
## $ compactness_se <dbl> 1.356680925, -0.683695689, 0.847154173, 2.805…
## $ concavity_se <dbl> 0.71209089, -0.41400693, 0.21812085, 0.804409…
## $ concave.points_se <dbl> 0.6825459, 0.2843963, 1.4417704, 1.1338896, -…
## $ symmetry_se <dbl> 1.172419370, -0.779156596, 0.261925750, 4.751…
## $ fractal_dimension_se <dbl> 0.913959900, -0.083344862, 0.306057486, 2.043…
## $ radius_worst <dbl> 1.8696583, 1.7894970, 1.4976275, -0.2823654, …
## $ texture_worst <dbl> -1.34090113, -0.35703897, -0.01398177, 0.1429…
## $ perimeter_worst <dbl> 2.30085932, 1.53384537, 1.34655126, -0.247826…
## $ area_worst <dbl> 1.96064959, 1.85181528, 1.42511569, -0.546512…
## $ smoothness_worst <dbl> 1.338856915, -0.379130952, 0.542497956, 3.468…
## $ compactness_worst <dbl> 2.69410378, -0.43561800, 1.11878871, 4.005450…
## $ concavity_worst <dbl> 2.15333090, -0.14259294, 0.87673281, 2.031285…
## $ concave.points_worst <dbl> 2.33628782, 1.11076543, 1.99054851, 2.2143529…
## $ symmetry_worst <dbl> 2.8082563, -0.2375773, 1.1824953, 6.1601542, …
## $ fractal_dimension_worst <dbl> 2.02437558, 0.29713613, 0.21389567, 5.1516732…
## $ diagnosis <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, …
Mean boxplots after centering and scaling
The boxplots bellow show the result of the transformed mean features (all other features were also transformed).
p <- list()
for (j in colnames(datos_train_prep)[1:10]) {
p[[j]] <- ggplot(data=datos_train_prep, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))6. Predictive models
The train control for all models was created using the cross-validation resampling method with 10 folds.
control_train<-trainControl(method = "cv",
number=10,
returnResamp = "all", #all resampled performance measures are saved
classProbs = TRUE, # class probabilities are computed
savePredictions = TRUE)Model 1: Glmnet
Glmnet fits a generalized linear model via penalized maximum likelihood. The penalization parameters are alpha and lambda, the hypeparameter selection was left by default. These hyperparameters deal with correlated predictors, which is of importance in this project.
set.seed(123)
modelo_glm <- train(diagnosis ~.,
method="glmnet",
family="binomial",
trControl=control_train,
data=datos_train_prep,
metric="Accuracy")
modelo_glm$bestTune## alpha lambda
## 5 0.55 0.007608795
The most important variable is radius_worst.
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 0
## M 0 42
##
## Accuracy : 1
## 95% CI : (0.9679, 1)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6283
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : B
##
Model 2: Random Forest
The Random forest is based on constructing a multitude of decision trees.
set.seed(123)
hip <- data.frame(mtry=1:30) # Randomly selected predictors
modelo_rf <-train(diagnosis ~.,
method="rf",
trControl=control_train,
data=datos_train_prep,
tuneGrid=hip,
metric="Accuracy")
modelo_rf$bestTune## mtry
## 1 1
The most important variable of the model is area_worst.
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 3
## M 2 39
##
## Accuracy : 0.9558
## 95% CI : (0.8998, 0.9855)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9048
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9286
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.9512
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6372
## Balanced Accuracy : 0.9502
##
## 'Positive' Class : B
##
Model 3: AdaBoost
The next algorithm used was AdaBoost. The hypeparameter selection for AdaBoost was left by default, one of the two methods and the number of iterations can be chosen.
set.seed(123)
modelo_ada <-train(diagnosis ~.,
method="adaboost",
trControl=control_train,
data=datos_train_prep,
metric="Accuracy")
modelo_ada$bestTune## nIter method
## 2 50 Real adaboost
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 9
## M 1 33
##
## Accuracy : 0.9115
## 95% CI : (0.8433, 0.9567)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 6.062e-12
##
## Kappa : 0.8029
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.9859
## Specificity : 0.7857
## Pos Pred Value : 0.8861
## Neg Pred Value : 0.9706
## Prevalence : 0.6283
## Detection Rate : 0.6195
## Detection Prevalence : 0.6991
## Balanced Accuracy : 0.8858
##
## 'Positive' Class : B
##
Model 4: Support Vector Machines with Linear and Polynomial Kernel
SVM training algorithm builds a model that assigns new examples to one category or the other. The classification performed can be linear or non-linear. In this case, the hyperparameters are cost, degree and scale. The chosen values are shown on the chunk bellow. The best model was the linear model.
set.seed(123)
hip_svmP <- expand.grid(C=c(0.001, 0.01, 0.1, 0.5, 1, 10),degree=c(1,2,3),scale=1)
modelo_svmPoly <- train(diagnosis ~.,
method = "svmPoly",
trControl = control_train,
data = datos_train_prep,
tuneGrid = hip_svmP,
metric = "Accuracy")
modelo_svmPoly$bestTune## degree scale C
## 7 1 1 0.1
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 0
## M 0 42
##
## Accuracy : 1
## 95% CI : (0.9679, 1)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6283
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : B
##
Model 5: Naive Bayes
The Naive Bayes algorithm is based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The hypermarameters were left by default. The results obtained are not as favorable compared to the other models probably because it doesn’t consider any kind of correlation between features.
set.seed(123)
modelo_nb <- train(diagnosis ~.,
method = "nb",
trControl = control_train,
data = datos_train_prep,
metric = "Accuracy")
modelo_nb$bestTune## fL usekernel adjust
## 2 0 TRUE 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 66 2
## M 5 40
##
## Accuracy : 0.9381
## 95% CI : (0.8765, 0.9747)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 1.718e-14
##
## Kappa : 0.8693
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9296
## Specificity : 0.9524
## Pos Pred Value : 0.9706
## Neg Pred Value : 0.8889
## Prevalence : 0.6283
## Detection Rate : 0.5841
## Detection Prevalence : 0.6018
## Balanced Accuracy : 0.9410
##
## 'Positive' Class : B
##
Model 6: k-Nearest Neighbors
K-nearest neighbors is an algorithm that classifies new cases based on a similarity measure (distance between neighbors). The hyperparameters were also left by default.
set.seed(123)
modelo_kknn <- train(diagnosis ~.,
method = "kknn",
trControl = control_train,
data = datos_train_prep,
metric = "Accuracy")
modelo_kknn$bestTune## kmax distance kernel
## 2 7 2 optimal
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 1
## M 1 41
##
## Accuracy : 0.9823
## 95% CI : (0.9375, 0.9978)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9621
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9859
## Specificity : 0.9762
## Pos Pred Value : 0.9859
## Neg Pred Value : 0.9762
## Prevalence : 0.6283
## Detection Rate : 0.6195
## Detection Prevalence : 0.6283
## Balanced Accuracy : 0.9811
##
## 'Positive' Class : B
##
Comparison of first models
The accuracy and kappa metrics of the 10-folds for each model is shown on the following table.
modelos <- list(GLM=modelo_glm, RF=modelo_rf, ADA=modelo_ada, SVM=modelo_svmPoly, NB=modelo_nb, KNN=modelo_kknn)
results_resamples <- resamples(modelos)
datatable(results_resamples$values)The time taken to compute all the folds and the final model for each algorithm is shown on the next table.
Some models obtained an accuracy of 1 on one of the folds. For this reason, in order to decide which model had a better performance the mean accuracy of all folds was computed.
metricas_resamples <- results_resamples$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples <- metricas_resamples %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()
metricas_resamplesThe models with the highest accuracy are the SVM model and the ADA model. The SVM algorithm has the highest mean and takes less time to execute than the ADA algorithm, which is actually the one that takes the longest. Their ROC curves are shown below, the AUC of both models demonstrate their perfect performance.
ROC curves of top models
library(ROCR)
predictions <- predict(modelo_ada, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)
plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("ADA")
grid()
predictions <- predict(modelo_svmPoly, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)
plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("SVM")
grid() The models obtained are prooved to be accurate and precise. Nevertheless, the correlation problem was only addressed on the Glmnet algorithm. Next, a principal component analysis is performed in order to
reduce the dimensionality of the data which also reduces multicollinearity.
7. Predictive models with PCA
The same six type of models were created after applying the PCA transformation to the data.
obj_recipe <- obj_recipe %>% step_pca(all_numeric(),num_comp = 10)
trained_recipe2 <- obj_recipe %>% prep(training=datos_train)
trained_recipe2## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Training data contained 456 data points and no missing data.
##
## Operations:
##
## Centering and scaling for radius_mean, texture_mean, ... [trained]
## PCA extraction with radius_mean, texture_mean, ... [trained]
datos_train_prep2 <-trained_recipe2 %>% bake(new_data = datos_train)
datos_test_prep2 <- trained_recipe2 %>% bake(new_data = datos_test)Comparison of models with PCA
library(rlist)
modelos2 <- list.append(modelos, GLMpca=modelo_glm2, RFpca=modelo_rf2, ADApca=modelo_ada2, SVMpca=modelo_svmPoly2, NBpca=modelo_nb2, KNNpca=modelo_kknn2)
results_resamples2 <- resamples(modelos2)Metrics…
Time…
metricas_resamples2 <- results_resamples2$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples2 <- metricas_resamples2 %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()As seen on the graph below, the SVM model and the ADA model still outperform all other models. The first SVM model still has the highest accuracy mean indicating that reducing dimensionality does not necessarily improve the model, regardless of the number of variables or the correlation. If execution time is taken into account the GLMpca model is much faster and also has a good performance.
Finally, the top two algorithms (SVM and ADA) were again used to create two models that take into account the class imbalance. All the features were taken into account.
8. Best predictive models with weights
The weight of each class has to be computed before implementing the algorithm.
datos_train_prep <- datos_train_prep[order(datos_train_prep$diagnosis),]
datos_train_prep[,order("diagnosis")]## # A tibble: 456 x 1
## radius_mean
## <dbl>
## 1 -0.161
## 2 -0.292
## 3 -1.31
## 4 -1.68
## 5 -0.584
## 6 -0.667
## 7 -0.133
## 8 -0.615
## 9 -0.735
## 10 -0.300
## # … with 446 more rows
n1train <- length(which(datos_train_prep$diagnosis=="B"))
n2train <- length(which(datos_train_prep$diagnosis=="M"))
ntrain<- dim(data)[1]
n_classes <- 2
weight1 <- ntrain/(n_classes*n1train)
weight2 <- ntrain/(n_classes*n2train)
allWeights <- c(rep(weight1,n1train),rep(weight2,n2train))Final comparison
modelos3 <- list.append(modelos2, ADAweights=modelo_ada3,SVMweights=modelo_svmPoly3)
results_resamples3 <- resamples(modelos3)Metrics…
Time…
metricas_resamples3 <- results_resamples3$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples3 <- metricas_resamples3 %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()
metricas_resamples3Surprisingly, the SVM model that takes into account the proportion of samples in each class has the same accuracy mean as the SVM model but takes a bit less time to compute.
9. Conclusions
As seen, the algorithm that best predicted the outcome of this dataset was SVM and the addition of weights had a positive impact on the model. The ADA model also has an outstanding performance but takes a lot of time. The time of execution is of great importance in some applications. So if time is relavant, the Glmnet with PCA is also an appropiate choice. Reducing the number of features, reduces the time of execution of the algorithm. The Glmnet model also has high accuracy and high AUC in all folds performed. In the future, other algorithms can be tested or other hyperparameters can be chosen in order to find better results. Other approaches can also be taken in order to deal with multicollinearity.